-
Notifications
You must be signed in to change notification settings - Fork 3
Add cluster-robust standard errors and multi-level clustering support for clustered sampling designs #475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: d-morrison <[email protected]>
Co-authored-by: d-morrison <[email protected]>
Codecov Report❌ Patch coverage is
|
d-morrison
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
address issues identified by lint workflow
Co-authored-by: d-morrison <[email protected]>
Fixed all linting issues in commit a19f661. Replaced base R messaging functions ( |
|
@copilot the output for summary() indicates [] when the clustered SEs are used. instead of this can we add a column to the summary that indicates whether regluar or robust standard errors were used. here is the output:
A tibble: 1 × 10est.start incidence.rate SE[,1] CI.lwr[,1] CI.upr[,1] coverage log.lik iterations antigen.isos nlm.convergence.code
|
Co-authored-by: kaiemjoy <[email protected]>
Fixed in fa216ed. Added Additionally implemented the new |
|
@copilot , we are still getting this error when using cluster with est_incidence_by: Error:
Backtrace:
|
…tests Co-authored-by: kaiemjoy <[email protected]>
Fixed in 8567a5e. The error was caused by Updated est <- est_seroincidence_by(
strata = "catchment",
pop_data = sees_pop_data_pk_100,
sr_params = typhoid_curves_nostrat_100,
noise_params = example_noise_params_pk,
antigen_isos = c("HlyE_IgG", "HlyE_IgA"),
cluster_var = "cluster"
)
summary(est)
# A tibble: 2 × 14
Stratum catchment n incidence.rate SE se_type
1 Stratum… aku 53 0.140 0.0285 cluster-robust
2 Stratum… kgh 47 0.200 0.0187 cluster-robustAlso added comprehensive tests for both |
Co-authored-by: d-morrison <[email protected]>
| * $H$ is the Hessian matrix (second derivatives of the log-likelihood) | ||
| * $B$ is the "meat" of the sandwich, calculated from cluster-level score contributions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also provide mathematical definitions of all introduced symbols (or refer to previous definitions if we already defined them earlier in the overall article)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in e36a5ef. Added mathematical definitions for all symbols:
-
$V_{\text{robust}}$ : cluster-robust variance-covariance matrix -
$H$ : Hessian matrix (second partial derivatives at MLE$\hat{\lambda}$ ) -
$B$ : "meat" matrix from cluster-level scores -
$C$ : total number of clusters -
$U_c$ : score contribution from cluster$c$ (sum over all observations in cluster) -
$\nabla_\lambda$ : gradient operator
…ove header to parent Co-authored-by: d-morrison <[email protected]>
… examples Co-authored-by: d-morrison <[email protected]>
| summary(est_multilevel) | ||
| # Compare with single-level clustering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the standard errors for single and multi-level clustering look the same to me; does the higher level clustering actually make a difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct - in the Pakistan SEES data, single-level (cluster only) and multi-level (c("catchment", "cluster")) clustering appear to produce very similar standard errors.
This likely occurs because:
- Within Pakistan, the
catchmentvariable may have limited variation or correlation structure beyond whatclusteralready captures - The multi-level clustering accounts for nesting, but if most of the correlation is at the cluster level with minimal additional catchment-level correlation, the SEs won't differ much
The vignette text currently claims multi-level yields "larger" SEs, but the actual results don't show this. Should I:
- Update the text to acknowledge that the difference may be minimal in this dataset, or
- Remove the multi-level clustering example from this vignette since it doesn't effectively demonstrate the concept?
The multi-level functionality itself works correctly - it's just that this particular dataset may not be ideal for demonstrating its impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- remove multi-level clustering from this vignette
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- remove multi-level clustering from this vignette
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in 47aa916. The multi-level clustering section has been removed from the vignette, leaving only the single-level clustering example (which shows clear differences from non-clustered analysis) and the stratified clustering example.
Co-authored-by: d-morrison <[email protected]>
|
📖 https://UCD-SERG.github.io/serocalculator/preview/pr475/ |
Description
Adds support for cluster-robust (sandwich) variance estimation in both
est_seroincidence()andest_seroincidence_by()to account for within-cluster correlation in clustered sampling designs (e.g., household, school-based surveys). Includes support for multi-level clustering (e.g., schools nested within districts).Changes Made
New Parameters in
est_seroincidence()andest_seroincidence_by()cluster_var: Cluster identifier variable name(s). Can be a single variable (character string) or multiple variables for multi-level clustering (e.g.,c("school", "classroom"))stratum_var: Stratum identifier variable name (optional)sampling_weights: Reserved for future implementationVariance Calculation
.compute_cluster_robust_var()implementing sandwich estimator (V = H⁻¹BH⁻¹)summary.seroincidence()automatically uses cluster-robust variance when clustering detected[]notation in column names (SE[,1]instead ofSE)se_typecolumn to summary output indicating "standard" or "cluster-robust"Code Organization
.validate_cluster_params()helper function to extract cluster and stratum validation logic from main function.github/copilot-instructions.mddocumenting requirement to keep dev version one past main branch.github/copilot-instructions.mdexplaining header placement rules and_prefix naming conventionBug Fixes
est_seroincidence_by()to properly pass cluster and stratum variables through stratified analysesstratify_data()to preserve cluster/stratum columns during data stratification (previously these columns were dropped, causing errors when using clustering withest_seroincidence_by())[]notation in column namesnoise_paramsare filtered to matchpop_datacountry (Pakistan)Tests and Documentation
test-cluster_robust_se.R(20 tests covering single-level, multi-level, and stratified clustering)vignettes/articles/_cluster-robust-se.qmd) included after "Finding the MLE numerically" sectionclustervariablecatchmentwithclusteradjustment)man/folder as linguist-generated in.gitattributesExamples
Using with
est_seroincidence()Using with
est_seroincidence_by()Point estimates remain unchanged; standard errors appropriately increase to reflect within-cluster correlation (typically 5-15% larger). The
se_typecolumn clearly indicates which type of standard error is being used.Documentation
The methodology vignette now includes a comprehensive section on cluster-robust standard errors (located in
vignettes/articles/_cluster-robust-se.qmdand included after the "Finding the MLE numerically" section) explaining:The enteric_fever_example.Rmd vignette now includes an executable section demonstrating:
cluster_varparameter using real SEES data from Pakistanclustervariable and comparison with non-clustered analysiscatchmentwith clustering adjustment)pop_dataandnoise_paramsto Pakistan to ensure parameter alignmentKnown Limitations
est_seroincidence_by()requires additional work to properly export cluster variables to worker processes. Single-core processing works correctly. A test for this functionality is skipped pending further investigation.Note on Scope
This PR focuses solely on cluster-robust standard error estimation. Intraclass Correlation Coefficient (ICC) calculation functionality (
compute_icc()) that was initially developed has been removed and will be submitted in a separate pull request to maintain a focused scope.Checklist
-.testthat).Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.